2: College Majors And Employment

The American Community Survey is a survey run by the US Census Bureau that collects data on everything from the affordability of housing to employment rates for different industries. For this challenge, you'll be using the data derived from the American Community Survey for years 2010-2012. The team at FiveThirtyEight has cleaned the dataset and made it available on their Github repo.

Here's a quick overview of the files we'll be working with:

all-ages.csv - employment data by major for all ages
recent-grads.csv - employment data by major for just recent college graduates


In [2]:
# %sh
# # download source file
# wget https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv
# wget https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv
# ls -l

In [3]:
import pandas as pd

all_ages = pd.read_csv("all-ages.csv")
print all_ages.columns
print all_ages.head(3)

recent_grads = pd.read_csv("recent-grads.csv")
print recent_grads.columns
print recent_grads.head(3)


Index([u'Major_code', u'Major', u'Major_category', u'Total', u'Employed',
       u'Employed_full_time_year_round', u'Unemployed', u'Unemployment_rate',
       u'Median', u'P25th', u'P75th'],
      dtype='object')
   Major_code                                  Major  \
0        1100                    GENERAL AGRICULTURE   
1        1101  AGRICULTURE PRODUCTION AND MANAGEMENT   
2        1102                 AGRICULTURAL ECONOMICS   

                    Major_category   Total  Employed  \
0  Agriculture & Natural Resources  128148     90245   
1  Agriculture & Natural Resources   95326     76865   
2  Agriculture & Natural Resources   33955     26321   

   Employed_full_time_year_round  Unemployed  Unemployment_rate  Median  \
0                          74078        2423           0.026147   50000   
1                          64240        2266           0.028636   54000   
2                          22810         821           0.030248   63000   

   P25th    P75th  
0  34000  80000.0  
1  36000  80000.0  
2  40000  98000.0  
Index([u'Rank', u'Major_code', u'Major', u'Major_category', u'Total',
       u'Sample_size', u'Men', u'Women', u'ShareWomen', u'Employed',
       u'Full_time', u'Part_time', u'Full_time_year_round', u'Unemployed',
       u'Unemployment_rate', u'Median', u'P25th', u'P75th', u'College_jobs',
       u'Non_college_jobs', u'Low_wage_jobs'],
      dtype='object')
   Rank  Major_code                           Major Major_category  Total  \
0     1        2419           PETROLEUM ENGINEERING    Engineering   2339   
1     2        2416  MINING AND MINERAL ENGINEERING    Engineering    756   
2     3        2415       METALLURGICAL ENGINEERING    Engineering    856   

   Sample_size   Men  Women  ShareWomen  Employed      ...        Part_time  \
0           36  2057    282    0.120564      1976      ...              270   
1            7   679     77    0.101852       640      ...              170   
2            3   725    131    0.153037       648      ...              133   

   Full_time_year_round  Unemployed  Unemployment_rate  Median  P25th   P75th  \
0                  1207          37           0.018381  110000  95000  125000   
1                   388          85           0.117241   75000  55000   90000   
2                   340          16           0.024096   73000  50000  105000   

   College_jobs  Non_college_jobs  Low_wage_jobs  
0          1534               364            193  
1           350               257             50  
2           456               176              0  

[3 rows x 21 columns]

3: Summarizing Major Categories

In both of these datasets, majors are grouped into categories. There are multiple rows with a common value for Major_category but different values for Major. We would like to know the total number of people in each Major_category for both datasets.


In [4]:
all_ages_major_categories = {}
recent_grads_major_categories = {}

def calculate_major_cat_totals(df):
  counts_dictionary = {}
  for cat in df["Major_category"].value_counts().index:
    counts_dictionary[cat] = df["Total"][df["Major_category"] == cat].sum()
  return counts_dictionary

all_ages_major_categories = calculate_major_cat_totals(all_ages)
recent_grads_major_categories = calculate_major_cat_totals(recent_grads)

print all_ages_major_categories
print recent_grads_major_categories


{'Arts': 1805865L, 'Psychology & Social Work': 1987278L, 'Business': 9858741L, 'Industrial Arts & Consumer Services': 1033798L, 'Computers & Mathematics': 1781378L, 'Agriculture & Natural Resources': 632437L, 'Interdisciplinary': 45199L, 'Humanities & Liberal Arts': 3738335L, 'Engineering': 3576013L, 'Biology & Life Science': 1338186L, 'Health': 2950859L, 'Law & Public Policy': 902926L, 'Physical Sciences': 1025318L, 'Education': 4700118L, 'Communications & Journalism': 1803822L, 'Social Science': 2654125L}
{'Arts': 357130L, 'Psychology & Social Work': 481007L, 'Business': 1302376L, 'Industrial Arts & Consumer Services': 229792L, 'Computers & Mathematics': 299008L, 'Agriculture & Natural Resources': 79981L, 'Interdisciplinary': 12296L, 'Humanities & Liberal Arts': 713468L, 'Engineering': 537583L, 'Biology & Life Science': 453862L, 'Health': 463230L, 'Law & Public Policy': 179107L, 'Physical Sciences': 185479L, 'Education': 559129L, 'Communications & Journalism': 392601L, 'Social Science': 529966L}

4: Low Wage Jobs Rates

The press likes to talk a lot about how many college grads are unable to get higher wage, skilled jobs and end up working lower wage, unskilled jobs instead. As a data person, it is your job to be skeptical of any broad claims and analyze relevant data to obtain a more nuanced view. Let's run some basic calculations to explore that idea further.


In [5]:
low_wage_percent = recent_grads["Low_wage_jobs"].astype(float).sum() / recent_grads["Total"].sum()

print low_wage_percent


0.0985254607612

5: Comparing Datasets

Both all_ages and recent_grads datasets have 173 rows, corresponding to the 173 college major codes. This enables us to do some comparisons between the two datasets and perform some initial calculations to see how similar or different the statistics of recent college graduates are from those of the entire population.


In [6]:
# All majors, common to both DataFrames
majors = recent_grads['Major'].value_counts().index
recent_grads_lower_emp_count = 0
all_ages_lower_emp_count = 0

for major in majors:
  recent_unemp = recent_grads["Unemployment_rate"][recent_grads["Major"] == major].values[0]
  all_unemp = all_ages["Unemployment_rate"][all_ages["Major"] == major].values[0]
  if recent_unemp < all_unemp:
    recent_grads_lower_emp_count += 1
  elif recent_unemp > all_unemp:
    all_ages_lower_emp_count += 1

print "Recent grads fare better: ", recent_grads_lower_emp_count
print "All ages fare better: ", all_ages_lower_emp_count


Recent grads fare better:  43
All ages fare better:  128